SoundCloud Analyis

Author

Hajara Muzammal

Introduction:

Music discovery increasingly occurs through playlists, where listeners curate and share collections of songs across platforms such as SoundCloud. Understanding which musical characteristics are associated with popularity and playlist inclusion can provide insight into listener preferences and emerging trends. This analysis focuses on tracks that include SoundCloud links, allowing us to study how audio features, sentiment, and popularity metrics relate to playlist usage.

Using a dataset of approximately 15,000 tracks with audio features, playlist metadata, and SoundCloud URLs, this report explores the relationship between popularity and musical characteristics such as danceability, energy, tempo, and key. The goal is not to predict popularity with a formal statistical model, but rather to visually and descriptively identify patterns that distinguish popular songs from less popular ones.

This work contributes to the broader question of how music spreads across user-generated platforms and provides a data-driven perspective on what makes songs more likely to appear in playlists.

Overarching Question:

What factors influence the popularity of songs across major music streaming platforms?

My Question:

Is track popularity associated with playlist inclusion?

Data Ingest

We use a publicly available dataset hosted on Hugging Face, which contains playlist metadata, song characteristics, and direct SoundCloud links.

Show code
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)

url <- "https://huggingface.co/datasets/Zuru7/Spotify_Songs_with_SoundCloud_links/resolve/main/song_df_normalised.csv"
SONGS_raw <- read_csv(url, show_col_types = FALSE)

# Standardize names (works even if you re-run the doc)
SONGS <- SONGS_raw %>%
  rename(
    track          = any_of(c("track", "track_name")),
    artist         = any_of(c("artist", "track_artist")),
    album          = any_of(c("album", "track_album_name")),
    popularity     = any_of(c("popularity", "track_popularity")),
    playlist_genre = any_of(c("genre", "playlist_genre")),
    playlist_subgenre = any_of(c("subgenre", "playlist_subgenre")),
    soundcloud_link = any_of(c("soundcloud_link", "links"))
  ) %>%
  filter(!is.na(track), !is.na(artist), !is.na(popularity))
glimpse(SONGS)
Rows: 14,987
Columns: 23
$ track             <chr> "i feel alive", "poison", "baby it's cold outside (f…
$ artist            <chr> "steady rollin", "bell biv devoe", "ceelo green", "k…
$ lyrics            <chr> "the trees, are singing in the wind the sky blue, on…
$ album             <chr> "love & loss", "gold", "ceelo's magic moment", "kard…
$ popularity        <dbl> 28, 0, 41, 65, 70, 52, 36, 42, 1, 58, 69, 72, 74, 41…
$ playlist_name     <chr> "hard rock workout", "back in the day - r&b, new jac…
$ playlist_genre    <chr> "rock", "r&b", "r&b", "pop", "r&b", "r&b", "r&b", "e…
$ playlist_subgenre <chr> "hard rock", "new jack swing", "neo soul", "dance po…
$ danceability      <dbl> 0.2166860, 0.8447277, 0.3580533, 0.7462341, 0.440324…
$ energy            <dbl> 0.8779620, 0.6460897, 0.3674362, 0.8850809, 0.632868…
$ key               <dbl> 0.81818182, 0.54545455, 0.45454545, 0.81818182, 0.54…
$ loudness          <dbl> 0.7817377, 0.6813893, 0.7425419, 0.8813965, 0.730275…
$ mode              <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1…
$ speechiness       <dbl> 0.02434122, 0.21616793, 0.01306387, 0.02065654, 0.03…
$ acousticness      <dbl> 0.011792960, 0.004353434, 0.694556021, 0.037297028, …
$ instrumentalness  <dbl> 0.010205339, 0.007422998, 0.000000000, 0.000000000, …
$ liveness          <dbl> 0.34221195, 0.48613476, 0.05781237, 0.13038190, 0.08…
$ valence           <dbl> 0.4080748, 0.6565622, 0.4090849, 0.2424166, 0.308073…
$ tempo             <dbl> 0.5545093, 0.4227024, 0.4605076, 0.5250801, 0.625378…
$ language          <chr> "en", "en", "en", "en", "en", "en", "en", "en", "es"…
$ sentiment         <chr> "Positive", "Positive", "Positive", "Negative", "Pos…
$ song_artist       <chr> "i feel alive steady rollin", "poison bell biv devoe…
$ soundcloud_link   <chr> "http://soundcloud.com/xobak3r/purple-vision-ft-xoro…

The dataset used in this analysis was obtained from Hugging Face and contains Spotify song metadata linked to SoundCloud URLs. This dataset is appropriate because it combines popularity metrics, playlist information, and detailed audio features while also enabling direct reference to SoundCloud content. The dataset consists of 14,987 observations and 23 variables, including track-level metadata, playlist attributes, audio features, sentiment labels, and SoundCloud links, making it well-suited for exploratory analysis of music popularity

Data Cleaning

Now lets clean the data.

Show code
PLAYLIST_TABLE <- SONGS %>%
  transmute(
    playlist_name   = playlist_name,
    artist_name     = artist,
    track_name      = track,
    album_name      = album,
    popularity      = popularity,
    playlist_genre  = playlist_genre,
    playlist_subgenre = playlist_subgenre,
    soundcloud_link = soundcloud_link
  )

glimpse(PLAYLIST_TABLE)
Rows: 14,987
Columns: 8
$ playlist_name     <chr> "hard rock workout", "back in the day - r&b, new jac…
$ artist_name       <chr> "steady rollin", "bell biv devoe", "ceelo green", "k…
$ track_name        <chr> "i feel alive", "poison", "baby it's cold outside (f…
$ album_name        <chr> "love & loss", "gold", "ceelo's magic moment", "kard…
$ popularity        <dbl> 28, 0, 41, 65, 70, 52, 36, 42, 1, 58, 69, 72, 74, 41…
$ playlist_genre    <chr> "rock", "r&b", "r&b", "pop", "r&b", "r&b", "r&b", "e…
$ playlist_subgenre <chr> "hard rock", "new jack swing", "neo soul", "dance po…
$ soundcloud_link   <chr> "http://soundcloud.com/xobak3r/purple-vision-ft-xoro…

After initial ingestion, the dataset was cleaned to focus on variables most relevant to playlist behavior and song popularity. The cleaned table contains 14,987 observations and 8 key variables, including playlist name, artist name, track name, album name, popularity score, playlist genre and subgenre, and a corresponding SoundCloud link. Tracks with missing popularity values were removed to ensure consistency across analyses, allowing all visualizations and comparisons to rely on complete and comparable observations. This streamlined structure provides a balanced combination of playlist context, song metadata, and platform linkage, making it well-suited for exploratory analysis of how musical characteristics and playlist placement relate to popularity.

Data Exploration

We define a “popular song” as one with a popularity that is greater than or equal to 70.

Show code
ppop_threshold <- 70
ppop_threshold
[1] 70
Show code
track_counts <- PLAYLIST_TABLE %>%
  distinct(playlist_name, track_name, artist_name, popularity) %>%
  count(track_name, artist_name, popularity, name = "playlist_appearances")

glimpse(track_counts)
Rows: 14,987
Columns: 4
$ track_name           <chr> "$20 fine", "$ave dat money (feat. fetty wap & ri…
$ artist_name          <chr> "jimi hendrix", "lil dicky", "max frost", "queen"…
$ popularity           <dbl> 44, 69, 43, 60, 0, 39, 83, 75, 50, 48, 55, 68, 5,…
$ playlist_appearances <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

To better understand how song popularity relates to playlist exposure, I constructed a track-level summary table that aggregates playlist information across the dataset. Each row in this table represents a unique track and artist combination, along with the song’s popularity score and the number of playlists in which it appears. Notably, most tracks appear in only a single playlist regardless of their popularity score, indicating that playlist inclusion in this dataset is relatively sparse and not dominated by a small number of highly repeated songs. This aggregation allows for direct comparison between popularity and playlist appearances and serves as the foundation for subsequent visual analyses examining whether more popular songs tend to receive greater playlist exposure

Popularity vs Playlist Appearances

Show code
ggplot(track_counts, aes(x = popularity, y = playlist_appearances)) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Popularity vs Playlist Appearances",
    x = "Track Popularity",
    y = "Number of Playlist Appearances"
  ) +
  theme_minimal(base_size = 13)

This plot shows the relationship between track popularity and the number of playlist appearances in the dataset. All tracks appear exactly once because the SoundCloud-linked dataset represents one playlist association per track, meaning playlist frequency cannot vary; as a result, no meaningful relationship between popularity and playlist appearances can be inferred from this data, highlighting a structural limitation of the dataset rather than a musical trend

Most danceable songs

Show code
SONGS %>%
  arrange(desc(danceability)) %>%
  select(track, artist, danceability, popularity, soundcloud_link) %>%
  slice_head(n = 5)
# A tibble: 5 × 5
  track                           artist danceability popularity soundcloud_link
  <chr>                           <chr>         <dbl>      <dbl> <chr>          
1 ice ice baby                    vanil…        1             70 http://soundcl…
2 cha cha slide - original live … dj ca…        0.999         54 http://soundcl…
3 funky friday                    dave          0.995         72 http://soundcl…
4 bad bad bad (feat. lil baby)    young…        0.994         81 http://soundcl…
5 cinnamon girl - radio edit      [dunk…        0.994         47 http://soundcl…

This table highlights the five most danceable songs in the dataset, ranked by Spotify’s danceability score. While these tracks all score extremely high on danceability—indicating strong rhythm and suitability for dancing—their popularity varies noticeably, ranging from moderately popular to highly popular songs. This suggests that although danceability contributes to a song’s appeal, it does not guarantee high popularity on its own, reinforcing the idea that popularity is influenced by multiple musical and contextual factors.

## Danceability vs Popularity

Show code
ggplot(SONGS, aes(x = danceability, y = popularity)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Danceability vs Popularity",
    x = "Danceability",
    y = "Popularity"
  ) +
  theme_minimal(base_size = 13)

This scatter plot reveals a weak but positive relationship between danceability and popularity. Although more danceable songs tend to have slightly higher popularity on average, the wide dispersion of points indicates that danceability is not a strong standalone predictor of a song’s popularity.

Tempo vs Popularity

Show code
ggplot(SONGS, aes(x = tempo, y = popularity)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(
    title = "Tempo vs Popularity",
    x = "Tempo",
    y = "Popularity"
  ) +
  theme_minimal(base_size = 13)

The relationship between tempo and popularity appears weak and non-linear, as shown by the relatively flat smoothed trend line. Popularity remains fairly stable across a wide range of tempo values, with no clear tempo range dominating popular songs. This indicates that tempo alone does not play a major role in determining a track’s popularity.

Conclusion:

This analysis examined whether track popularity is associated with playlist inclusion in order to better understand broader factors that influence song popularity across major music streaming platforms. Using a SoundCloud-linked Spotify dataset, the results show that playlist inclusion alone is not a strong driver of popularity: most tracks appear in only one playlist regardless of their popularity score, indicating that playlist presence does not scale proportionally with popularity. While certain audio features such as danceability and energy exhibit weak positive relationships with popularity—suggesting that rhythmically engaging and moderately high-energy songs are more likely to achieve higher popularity—these effects are modest and highly variable. Popular songs also tend to cluster around mid-to-high valence levels, indicating a preference for emotionally positive or balanced tracks rather than extreme moods, while tempo shows little systematic relationship with popularity. Taken together, these findings suggest that song popularity is influenced by a combination of musical characteristics and platform dynamics rather than simple playlist exposure alone, highlighting the multifaceted nature of music consumption and discovery across streaming platforms

Future Work

There are several ways this project could be extended to more fully address the factors that influence song popularity. First, incorporating richer playlist data—such as playlist follower counts, track order within playlists, and repeated appearances across multiple playlists—would allow for a more nuanced analysis of how playlist exposure impacts popularity. Second, integrating time-based information (e.g., release year, time since release, or changes in popularity over time) could help distinguish between long-term popularity and short-term viral success. Additionally, expanding the dataset to include direct SoundCloud engagement metrics, such as likes, reposts, or comments, would strengthen cross-platform comparisons. Finally, future analyses could move beyond exploratory visualization by incorporating predictive models or resampling-based inference to quantify the relative importance of different audio features and playlist dynamics, providing deeper insight into the mechanisms that drive song popularity on streaming platforms.